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Abstract 

Background: On the pretext that sequence reads and contigs often exhibit the same kinds of base usage that is 
also observed in the sequences from which they are derived, we offer a base composition analysis tool. Our tool 
uses these natural patterns to determine relatedness across sequence data. We introduce spectrum sets (sets of 
motifs) which are permutations of bacterial restriction sites and the base composition analysis framework to 
measure their proportional content in sequence data. We suggest that this framework will increase the efficiency 
during the pre-processing stages of metagenome sequencing and assembly projects. 

Results: Our method is able to differentiate organisms and their reads or contigs. The framework shows how to 
successfully determine the relatedness between these reads or contigs by comparison of base composition. In 
particular, we show that two types of organismal-sequence data are fundamentally different by analyzing their 
spectrum set motif proportions (coverage). By the application of one of the four possible spectrum sets, 
encompassing all known restriction sites, we provide the evidence to claim that each set has a different ability to 
differentiate sequence data. Furthermore, we show that the spectrum set selection having relevance to one 
organism, but not to the others of the data set, will greatly improve performance of sequence differentiation even 
if the fragment size of the read, contig or sequence is not lengthy. 

Conclusions: We show the proof of concept of our method by its application to ten trials of two or three freshly 
selected sequence fragments (reads and contigs) for each experiment across the six organisms of our set. Here we 
describe a novel and computationally effective pre-processing step for metagenome sequencing and assembly 
tasks. Furthermore, our base composition method has applications in phylogeny where it can be used to infer 
evolutionary distances between organisms based on the notion that related organisms often have much 
conserved code. 



Introduction and related work 

During a DNA sequencing task, the nucleotides of the 
reads or contigs must be placed in the correct order to 
reconstruct the original sequence. This sequencing task 
is particularly challenging when working with a metage- 
nomic task, which requires one to gather and order simi- 
lar sequences from a number of different organisms. This 
metagenomic technique has been extensively discussed 
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in [1,2] and a framework to infer phylogenetic relation- 
ships (patterns) among assemblages of microorganisms 
has been developed [3]. This approach is expected 
to help improve assembly projects by reducing search 
spaces when grouping related sequence fragments. 
Massively parallel next-generation sequencing technolo- 
gies (a major technological rebirth of the former Sanger 
methods of the 1980's [4]) provide ultrahigh throughput 
results at a low cost but the reads are often too short to 
be able to determine their adjacency. In [5], the authors 
describe a novel method for de novo assembly of large 
genomes from short read sequences which they used to 
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assemble two giant genomes: the Asian and African 
human genome sequences. 

Some of the limitations encountered in the assembly 
process include read coverage and size. The absence of 
placement information such as read coverage forms a 
bottleneck in the reassembly process [6,7]. When the 
read sequences are very short, then special procedures 
must be taken to maximize their informational content 
to achieve placement evidence. For this work, it may be 
necessary to form contigs by de novo assembly methods 
as in [8]. Despite these limitations, technologies such as 
Velvet and Oases have been used for many genome 
assembly projects [9,10] and [11]. Assembling reads 
using approaches from probability theory, or from the 
memory-based, are gaining popularity. This was deter- 
mined by Zhang et. al. [12] who compared the perfor- 
mance of eight distinct tools (i.e., SSAKE, VCAKE, 
QSRA, SHARCGS, Edena, Velvet, SOAPdenovo, and 
Taipan) against eight groups of simulated datasets. 

In metagenomic studies, where there are different 
kinds of reads or contigs mixed together into the same 
pool, the task of separating them back into n-distinct 
groups, becomes an NP-hard problem. Although a 
researcher may choose to determine their order using 
some computational tools, as described in Figure 1, this 
still is an NP-hard problem to separate the sequence 
data. 

Furthermore, this difficulty of separating the reads 
may prevent the assembly tools from ever being used 
optimally. In [13], the authors discuss the problem of fil- 
tering the reads or contigs into smaller groups for better 
management. Time and productivity can be saved by 



these pre-processing steps where related sequence mate- 
rial is placed into a bin (here called, binning) to reduce 
search spaces for reconstructing entire sequences or 
genomes. It is therefore important to perform efficient 
binning steps to save costs in the sequencing task to 
reduce the work-load in an assembly project. 

Chromosomal material across different genera were 
organized into species-specific groupings by virtue of the 
motif composition which was contained in the DNA [14]. 
In our study, we present a similar framework of organizing 
samples of DNA by their motif content. Our method dif- 
fers from the authors' work, however, because it could be 
applied to smaller sequence fragments than chromosomes 
and it also employs motifs of similar base-composition to 
associate (e.g., bin) sequences of different organisms into 
related groups. Our set of motifs are biologically relevant 
since they were derived from known bacterial restriction 
sites. We permuted the base composition of the bases 
found in a particular restriction site to generate a list of all 
possible motifs of the same composition. Here, all the 
motifs belonging to a set of the same base composition is 
said to form a spectrum set. We show that an organism's 
recognition sequence belongs to only one of the four pos- 
sible non-palindromic spectrum sets. Furthermore, each 
set must be strategically selected for successful sequence 
binning. 

Our hypothesis is that a restriction site base composi- 
tion algorithm can be used to separate and bin the 
sequence material from several different organisms. Our 
method compares the spectrum set motif proportions 
between sequences and uses this knowledge to separate 
them. For instance, if the motifs have similar proportions 




Figure 1 Sequence fragments are separated into groups (called/'bins") of relatedness by a quick pre-proccessing step. This graphic taken from 
our previous work in [33]. 



Bonham-Carter et al. BMC Bioinformatics 2013, 14(Suppl 1 1):S5 
http://www.biomedcentral.eom/1 471 -21 05/1 4/S1 1 /S5 



Page 3 of 14 



across two sequences, then there is evidence to suggest 
that the sequences are related to each other in some way. 
Here, this relation is called an association. In summary, 
our work stands apart from the traditional assembly pre- 
processing methods found in wet-labs since our method 
relies on statistics alone to find likenesses across 
sequence material to discover associations and bin the 
sequence data. 

Methods 

Genome sequences 

The genomic DNA sequences were studied from six dif- 
ferent phylogenetic groups (Actinobacteria, Firmicutes 
and Proteobacteria) shown in Table 1. The genomes 
and chromosomes for this study were downloaded from 
Genbank (a public and international online database) 
and were manipulated with tools which we describe 
below. All sample genomes were at least 1 Mb in length. 
There is evidence to suggest that GC-groups have a ten- 
dency to mutate to AT groups [15,16]. Furthermore, it 
is thought that similar GC composition implies similar 
genomic structure [17]. In light of this knowledge, our 
analysis was drawn from bacteria comprising many 
raised and lowered levels of GC-content. 

In each experiment, ten trials of five freshly drawn 
reads were studied. We used MetaSim [18] to create 
artificial contigs (or reads) similar to those of an actual 
assembly project. Each set of contigs (or reads) was 
extracted from the Contig Originator organism of 
Table 1. We applied all four spectrum sets to deter- 
mine the proportional distributions used for the leaf 
weights in our heatmap trees. We placed randomly 
selected contigs in each test. There were also several 
other related organisms added to each pool to test and 
further determine the association behaviors. 

We found very similar trends in each division. We illus- 
trate them by discussing the arbitrarily chosen the organ- 
isms Staphylococcus and Clostridium of the Firmicutes 
division. The results from the other divisions featured in 
Table 1, Proteobacteria and Actinobacteria, were very 
similar to the findings of the Firmicutes. 



Table 1 Genera for the study 


Organism 


Contig Originator 


Division 


Bifidobacterium longum 


NC_004307 


Actinobacteria 


Mycobacterium bovis 


NC_002945 


Actinobacteria 


Clostridium tetani 


NC_004557 


Firmicutes 


Staphylococcus aureus 


NC_007622 


Firmicutes 


Burkholderia pseudomallei 


NC_012695 


Proteobacteria 


Campylobacter jejuni 


NC_008787 


Proteobacteria 



This table displays the genera used in our study. The Read Originator column 
displays the sequences which we processed via MetaSim for its reads. To 
determine their general associative behaviors, we studied ten trials of five 
freshly drawn reads. We chose two organisms from each of the three divisions 
from which to draw our contigs. 



Read and contig sequences 

The synthetic data was made up of shorter reads of less 
than 1 kbp and were generated utilizing the 454 frame- 
work that was offered by the MetaSim software took 
MetaSim selects its reads by a statistical approach 
according to user input. The software simulates the 
approaches of both Sanger sequencing and Roche's 454 
(sequencing-by- synthesis). The maximum allowed length 
of contigs by MetaSim is 1 Kbp and so the longer reads 
or contigs for this study (1 Kbp 30 Kbp) had to be gen- 
erated by our own tool, which also follows the 454 
(sequencing-by-synthesis) method. We created longer 
reads or contigs of lengths 2 kbp, 5 kbp, 10 kbp and 30 
kbp for an exhaustive study using this tool. Although it 
may appear that some of these reads are unnaturally 
long, we note that the typical lengths of reads appear to 
be growing as the sequencing technology improves and 
evolves. 

In our experiments, we ran binning tests containing 
many reads or contigs but due to redundancy in the out- 
come of the analysis, our tests required only about five to 
ten reads or contigs to display the relevant trends. This 
small set of sequence data was acceptable to our work 
because we often observed that nearly all of the reads of 
a larger set had very similar distributions of motifs con- 
tent from the spectrum sets. 

Motifs 

REBASE [19] is an online database of information con- 
cerning bacterial restriction enzymes and their recognition 
sites. Each of the organisms (Campylobacter, Burkholderia, 
Bifidobacterium, Mycobacterium, Clostridium, Staphylo- 
coccus) were queried at REBASE for their organism-speci- 
fic, palindromic recognition site sequences of length-6. 
This length was desirable for our work because (1) it is a 
common size in bacteria, mitochondria and plasmids and, 
(2) it is statistically interesting. For example, let A be the 
size of the DNA alphabet {A, C, G, T} (four elements) and 
let L be the motif length. There are = 4! = 4 3 = 54 
possible palindromic sites available from the set of all pos- 
sible length-6 words, A L = 4 6 = 4096. When compared to 
the seemingly spontaneous occurrence rates of the shorter 
motifs, these longer words are less likely to be random 
occurrences along the genome. 
Base compositions and spectrum sets 
There are usually several uniquely spelled, palindromic 
recognition sequences of length-6 for each bacterial organ- 
ism according to REBASE. For example, Clostridium has 
eleven recognition sequences (Figure 2), and Staphylococ- 
cus (Figure 3) has only four. It is typically rare to find 
common recognition sequences between two organisms 
however, in this case, ATGCAT is common to both. Con- 
sulting REBASE, we found all the known restriction sites 
and placed each into one of four unique sets according to 
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Figure 2 The spectrum set taken from the four restriction sites of the Clostridium genera. There are ten unique recognition sites covering all 
four spectrum sets (shown in Figure 4). This graphic taken from our previous work in [33]. 



their DNA compositions. In Figure 4, we show this group- 
ing of all restriction sites. We call these sets, spectrum sets 
where each element of a set contains the same count of 
each base. We name each set by the following motifs: 
AAATTT, AATTCG, CCGGAT and CCCGGG, For exam- 
ple, the motifs, ATTTAA, AATTTA, TAAATT and 
TTAAAT, are all elements belonging to the AAATTT - 
spectrum set. A DNA word, w is palindromic if w = = 
reversed[Complemented{w)]. We do not consider palin- 
dromes in our spectrum sets, although many of the 
restriction sites of restriction modification systems are 
naturally palindromic, since they are thought to be avoided 
in the genome [20]. This avoidance property may confuse 
our results since we are investigating their occurrences in 
a sequence. Table 2 lists the sizes of each set. 

Proportions 

We use proportions, not frequencies, in our study of 
motif content because we are only using a subset of 
the set of all possible motifs of length-6. We ignored 



overlapping palindromes (no nested motifs) in the 
sequences for simplicity. The motif occurrence data in 
the sequences was normalized to make the compari- 
sons meaningful. We determine the proportionality for 
each motif in a set across a genome by the following 
Equation 1: 
The proportion of, 

. count(mi) * \mi\ 
m in S L = — (1) 

where m t is a motif, S L is a sequence fragment (a read, 
contig or genome), count(mi) represents the number of 
occurrences of m t found in S L , and \m t \ and \S L \ are the 
lengths of the motif and the sequence, respectively. For 
each motif in a spectrum set, the proportion of sequence 
that is made up of the motif is calculated by this equation. 
For each spectrum set, a vector is created from all propor- 
tions to be applied to a clustering analysis by hclust: a com- 
mand in the R Statistical software [21]. The result of the 
analysis is a heatmap [22] to determine the associations. 
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We used the motif proportions, to make vectors from 
each sample sequence. Comparing the vectors across 
the organisms determined likenesses and relatedness. If 
the vectors of the spectrum set motifs were similar 
between sequences, then this may have been an indica- 
tion of much common DNA between both sequences. 
This may also suggest a degree of relatedness between 
the organisms. Since a contig comes from a sequence, 
then the contig and the sequence will both share all 
their DNA and so our analysis will locate these similar- 
ity patterns and bin them together. Our analysis code 
was written in Python and soon will be made available 
to the bioinformatics community by our software repo- 
sitory located at [23]. In Figure 5, we provide a sum- 
mary of how our method is applied. 

Results and discussion 

According to their proportions of motif content, the 
clustering in heatmaps describes a tree of relatedness 
between the organisms. Similar proportions between the 



sequences are indicated by their close proximity in a 
subtree of the main tree of relatedness. A parent 
sequence is one which is closely related to the sequence 
from which the reads or contigs were derived. Since 
these fragments may contain large regions of common 
code with parent sequence(s), they will associate with 
them and will be found in its subtree in our heatmaps. 
By association, we imply that there is ample evidence to 
suggest that the reads or contigs are more similar to their 
parent(s) than any other genome in the tree of related- 
ness. We, furthermore, suggest that these fragments 
make up a sequence that is related to the parent (s). This 
property can be utilized to create bins from which to 
begin assembling each sequence in the reassembly task. 

Sequence data 

In the following, we discuss the task of binning long 
reads or contigs. Here, we choose to use DNA strands 
of a length 5000 bps. These strands shall be contigs for 
the purpose of describing the tool that manipulates 
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Figure 4 From its base composition, each bacterial restriction site fits into only one of the four spectrum sets, featured by unique color patterns. 
The motifs of each set are made up by the permutations of one of the following words; AAATTT, AATTCG, CCGGAT or CCCGGG. This result taken 
from our previous work in [33]. 



them. Our method is a tool to determine the propor- 
tions of motifs occurring in sequence data. The tool 
requires enough information from each strand to make 
correct decisions about relatedness and if there is an 
insufficient amount of sequence material for compari- 
sons to others, then our base composition tool will 
make poor determinations. Sequence fragments of 700 
bps were often enough to show the trends we discuss in 
this paper, but we found some errors. We found that 



Table 2 Members of the spectrum sets 


Set Seed 


Available Motifs 


AAAm 


12 


CCCGGG 


12 


AATOG 


156 


CCGGAT 


156 



The numbers of available motifs belonging to each spectrum. The motifs in 
the spectrum set are nonpalindromic and are permutations of the set seeds. 
The set created from the permutation of AATTCG is called, the AATTCG- 
spectrum, for example. 



longer sequence data provided clearer and more accu- 
rate results due to having enough base information 
upon which our method relies. 

This size of sequence material may seem large if the 
sequences were reads and not contigs. However, we note 
that sequencing and assembly technologies appear to con- 
tinually create longer reads than previous technologies. 
Very large sizes may soon be a reality since read pre-pro- 
cessing methods and various read alignment technologies 
are already being used to create larger contigs [24-29]. 

Clostridium and Staphylococcus 

Clostridium and Staphylococcus typify the kinds of phe- 
nomena we observed after of ten trials of each experi- 
ment, using the arbitrarily selected pairs of organisms 
from Table 1. Here will describe the typical kinds of 
observed phenomena using spectrum sets on these two 
organisms. We will begin by showing that the two genera 
groups, Clostridium and Staphylococcus, are unrelated by 
the analysis of their motif proportions. We note from 
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Figures 3 and 4 that only Clostridium, having the recog- 
nition sites ATTAAT and CCCGGG, can be discrimi- 
nated by the AAATTT and CCCGGG-Spectrum sets 
(Staphylococcus does not have restriction sites of this 
composition). By our analysis of motif proportions of this 
spectrum set, we see that both organisms have very dif- 
ferent proportions of these spectrum sites. 

We note from Figures 6 and 7 that there were two 
clearly contrasted subtrees in the heatmap to separate the 
two organisms. There was similar contrast between the 
sequences of our other heatmaps of the other organisms. 
In the present two organisms, we noted that the heatmaps 
are nearly opposite from each other: the Clostridium 
family members tend to have warmer colors (elevated pro- 
portions) and the Staphylococcus members have colder 
colors (low proportions) in the AAA TIT-spectrum set. 
This trend is the inverse for the CCCGGG-spectrum set. 

The AAITCG-spectrum set was also successful in show- 
ing two different family subtrees but there was much less 
apparent contrast between the organisms than there was 
when using the AAA TIT-spectrum set. We attribute this 
high contrast to the phenomenon that a spectrum set may 
perhaps be more biologically relevant to one of the organ- 
isms than the other, according to their recognition 
sequence usage. The CCGGA T-spectrum set was not typi- 
cally very successful in showing contrasts for binning in 



our trials for these organisms. This same experiment was 
performed ten times with different (i.e., newly selected) 
contigs and we observed similar results in the heatmaps as 
those discussed. We suggest that since the Staphylococcus 
group appears to have a higher proportion of CCCGGG 
content than Clostridium, this contrast helps to associate 
the reads by relations. 

It is clear that the proper use of the correct spectrum set 
can neatly differentiate one organism group from another 
for binning. Above, we saw that there are differences in 
the amounts of the spectrum sets which are found in the 
organisms. This made a high contrast which helped to 
determine one organism from another. We will now dis- 
cuss how this method can discriminate between only read 
or contig sequence data. 

Proportional differences in contigs by spectrum sets 

We shall now discuss an application of separating reads 
originating from three different organisms that have been 
mixed together into the same pool. Incidentally, a part of 
this process comprises the separation of contigs belonging 
to two different organisms. For our test, we arbitrarily 
selected another organism (featured in our organism 
group in Table 1) Burkholderia pseudomallei to be added 
to the contigs from Clostridium tetani and Staphylococcus 
aureus. The contigs are of length 5000 bps which we 
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chose to illustrate the test and to showcase its 
performance. 

Tests to determine pliable spectrum sets 

When working with the contigs of two organisms, a 
spectrum set could be selected based on the restriction 
sites which are inherent to the involved organisms. 
However, a sequencing task may combine contigs of 
three or more organisms together. The contigs of each 
organism will have to be separated from those of the 
other organisms to make the sequence assembly more 
feasible. Due to the large number of contigs in the 
whole project, it may not be convenient to run a base 
composition analysis over all sequence data and so, to 
determine the spectrum set for the binning task, it is 
suggested to use the spectrum set test as shown in 
Figure 8. This test is a base composition analysis taken 



only across the organisms who are known to be close 
relatives of the contigs (parents) in the pool. In Figure 8, 
we note that Burkholderia has the lowest proportions of 
the AAA riT-spectrum set. Conversely, in Figure 9, Sta- 
phylococcus and Clostridium have the lowest propor- 
tions of the CCCGGG-spectrum set. When either of 
these spectrum sets are applied to the pool of all con- 
tigs, we note that the Burkholderia, Staphylococcus and 
Clostridium contigs reflect the same trends observed at 
the genome-level. For instance, Figures 10 and 11 reflect 
the underlining trends of Figures 8 and 9, respectively, 
in terms spectrum set motif coverage. 
Removal of the contrasting contig group 
In Figure 8 (spectrum set AAATTT ), we noted that 
Burkholderia had low proportions of this set, and also 
in Figure 9 (spectrum set CCCGGG } the opposite was 
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Figure 8 The /W\77T-Spectrum set test. The sequence data is applied to our base composition analysis to determine its relatedness. 




Figure 9 The CCCGGG-S pectrum set test. The sequence data is analyzed by base composition to determine relatedness. 




Figure 10 The /\/\/\77T-spectrum set analysis taken across all sequence data in a pool. The Burkholderia pseudomallei sequence data, having 
elevated proportions of the motifs of this spectrum set, create a contrast from those of Clostridium tetoni and Staphylococcus aureus. These 
organisms are observed to have mixed proportions by this heatmap. 
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Figure 11 The CCCGGG-spectrum set analysis taken across all the contigs in the pool. We note that the Burkholderia pseudomallei sequence 
data, having low proportions of the motifs of this spectrum set, create a contrast from those of Clostridium tetani and Staphylococcus aureus. 
These organisms are observed to have mixed proportions by this heatmap. 
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true. In Figures 10 and 11, we see that the Burkholderia 
contigs also show this same pattern. Therefore, by this 
strong contrast, we could remove all contigs which 
show these strong contrasts and in doing so we would 
likely be binning the Burkholderia contigs. We note that 
the spectrum set AA TTCG was unable to to show con- 
trasts between two of three organisms (Figure 12) but 
Burkholderia was still a contrasting group. Interestingly, 
without this organism, the AATTCG spectrum set 
clearly differentiated Staphylococcus and Clostridium 
contigs as shown in Figure 13. This suggests that the 
addition of Burkholderia (having such low proportions 
of the spectrum set motifs) to the set may change the 
parameters of the heatmap software. 

Phylogeny from full chromosomes 

To demonstrate its ability to differentiate sequence data 
into biologically relevant groups, we show that our 
method is able to form phylogenetic trees which con- 
form to NCBFs taxonomy tool [30]. In our example, we 
arbitrarily selected a chromosome from each of seven 
diverse organisms listed in Table 3. We then applied 
our framework to extract the distributions of each spec- 
trum set and compared the results to the taxonomy tree 
in Figure 14 from NCBI which is based on the classifica- 
tion of their taxonomy database and other resources. 

We remind the reader that the subtrees in this exam- 
ple contain organisms that may be related by basic evo- 
lutionary phenomena. If we had contigs in the pool 
from each of these organisms, then these fragments 



would associate to form more specific family subtrees. 
Instead, this data is chromosomal sequence material 
which group by relatedness. 

We inspected the resulting trees of this example with 
the following criteria: the bacterium should be the most 
evolutionarily distinct organism. The mammals (i.e., the 
dog, rabbit, rat and mouse) should be the most evolutio- 
narily similar group of the set. The worm and the fruit 
fly should be found in a subtree which is evolutionarily 
between the bacterium and the mammals. Indeed, the 
worm and the fruit fly are quite diverse organisms, how- 
ever, for this example they are clearly more similar to 
each other (than to the bacterium) and do not belong to 
the set of mammals. Therefore, our inspection involved 
checking for three basic subtrees: one for the mammals, 
one for the worm and fruit fly, and a subtree containing 
only the bacterium. In other words, the subtrees had to 
be arranged similarly to those of NCBFs taxonomy tree 
shown in Figure 14. 

In Figures 15 through 18, we note the phylogenetic 
trees from each spectrum set. By inspection, the closest 
trees to the one in Figure 14 are from the CCGGAT 
and AAA TTT spectrum sets, Figures 15 and 16, respec- 
tively. Both of these trees show that the bacterium is 
most evolutionarily distant from rest of the organisms 
and that the fruit fly and the worm form a subtree 
which is distinct from that of the mammals. The loca- 
tions of the subtrees in both figures are in the same 
configuration as illustrated in NCBFs taxonomy tree 
however, the tree of the AA47TT-spectrum set is not as 




Figure 12 The /\/\7TCG-Spectrum set test: The genomes or chromosomes are analyzed by base composition to determine the expected 
clustering behavior of their contigs. 
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Table 3 Genomes used in the test 



Locus Organism Common Name 

NC_003279 Caenorhabditis elegons chrml Worm 

NC_006583 Canis lupus familiaris chrm 1 Dog 

NT_033779 Drosophila melanogaster chrm 2L Fruit fly 

NT_039169 Mus musculus chrm 1 genomic contig Mouse 

NC_016829 Mycoplasma hyorhinis GDL-1 chrm 1 Bacteria 

NW_003 159226 Oryctolagus cuniculus breed Thorbecke inbred chrml Rabbit 

NW_047544 Rattus norvegicus chrm 1 Rat 

The organisms used in the base composition experiment. We note that rabbit, dog, mouse and rat are seemingly more closely related than the bacteria, fruit fly 
and the worm. This observation is used as a first-glance assessment of the heatmaps below. 



accurate as that of the CCGGA T-set due to the dis- 
played shorter evolutionary distances (for instance, 
longer branch lengths indicate more distance). In addi- 
tion, the distance between rat and mouse is expectedly 
closer by the CCGGA T-spectrum set than by the 
AAATTT-set 

The tree from the AATTCG -spectrum set in Figure 17 
shows that the bacterium is evolutionarily found between 
the mammal's subtree and that of the worm and fruit fly. 
This is inaccurate by the taxonomy tree Figure 14. In addi- 
tion, the tree from the CCCGGG-spectrum set (Figure 18) 
is also inaccurate since it shows that the fruit fly is closely 
related to the rabbit. These results confirm our earlier 
findings that the choice of the correct spectrum set is of 
paramount importance for a successful analysis. 

Since the spectrum set motifs were originally inspired 
from palindromic restriction sites, we also studied the 
proportions of an exhaustive list of length-6 palindromic 
motifs (64 in total) across the sequence data. Interest- 
ingly, although palindromic motifs have been known to 

/ \ 

cellular organisms 

E Mycoplasma hyorhinis 

\h Codomata 
gjEcdysozoa 

I e Caenorhabditis elegans 
I s Drosophila melanogaster 

\k Euthcria 
ijGlircs 

I B Oryctolagus cuniculus 
I a Murinac 

Fi Rattus norvegicus (Rattus norvegicus ) 

B Mus musculus 
E Canis lupus familiaris 

Figure 14 NCBI's Taxonomy Tree used for validation and 
comparison. This phylogenetic tree was used to compare the 
results of the spectrum set analysis of the organisms listed in Table 
3. We ranked the results on a scale of highest to lowest 
resemblance in Table 4. 

v / 



successfully differentiate chromosomes, as shown in 
[14]. However in Figure 19, we note that palindromes 
do not successfully recreate the taxonomy tree from 
Figure 14. 

To summarize these results, we offer Table 4 which 
contains the highest to lowest resemblance to the tree in 
Figure 14. We note from their ranking that the spectrum 
sets do not behave uniformly and that further study is 
required to understand how they should be applied to a 
particular set of organismal data for classification. 

Conclusion 

As in playing with jigsaw puzzles, if there are the pieces 
of several different puzzles in the same box, then the 
completion of any one of the puzzles is a sizable under- 
taking. In the same way, during a sequence assembly 
task where the contigs of different organisms are mixed 
together in the pool, much time can be spared by first 
sorting the contigs into their own bins from which to 
work. Our method places many of the unknown contigs 
into their corresponding bins to drastically reduce the 
search space. 

Most of this paper discussed working with contigs 
which are typically longer than reads. Our base compo- 
sition analysis tool works by quantifying the amount the 
spectrum set motifs which are contained in the 
sequence data. When there is not enough sequence 
data, then our method may produce erroneous results 
and so we suggest using contigs of at least 1000 bps 
because they should contain enough sequence informa- 
tion for a good analysis. We should mention here that 
we have had very good results when using contigs of 
700 bps in length and so 1000 bps is not always abso- 
lutely necessary. 

By sorting the contigs with related sequence data which 
is based on motif proportions, our method aims to 
accomplish the binning task. We used heatmaps to show 
the contig clusters by organism-types. Furthermore, we 
illustrated that there are only four spectrum sets, which 
can be created from length-6 recognition sites to apply to 
differentiate by contrasting the sequence data. For 
instance, we used the AAATTT and CCCGGG-spectrum 
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Figure 15 The CCGGA 7-spectrum set. This tree perfectly resembles the taxonomy tree of Figure 14 and shows the great evolutionary distances 
between the organisms. The rat and mouse are found to be closely related. We note tree distinct subtrees: one containing the bacterium, one 
for the mammals and one containing the worm and fruit fly. The location of these subtrees conforms to taxonomy tree. 




Figure 16 The /W\77T-spectrum set. This tree also resembles the taxonomy tree but there is a slight distance between mouse and rat which is 
not found in Figure 15. We note tree distinct subtrees conforming to the taxonomy tree. 






Figure 17 The /\/\7TCG-spectrum set. We note that mouse and rat are not closely related. The bacterium is also evolutionarily located between 
the mammals and the subtree containing the worm and fruit fly. 




Figure 18 he CCCGGG-s pectrum set. This tree is inaccurate because it indicates that the rabbit and the fruit fly are closely related. 



sets to show that one set had high proportional values in 
one organism, but not the other. This created the con- 
trast that would help to bin the contigs of these two 
organisms. We then showed how the contigs of three 
organisms can be binned in a two-step process. We first 
removed the most contrasting set of contigs in the pool 
and then reapply our method to the remaining contigs. 



An analysis by base composition can also be used to 
determine evolutionary orders of organismal sequence 
data. For instance, we showed that our method could cre- 
ate phylogenetics trees which were very similar to those 
produced by NCBI's taxonomy tool. 

One of the leading benefits to our method is that 
there is no setup required as there would be for other 
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Figure 19 Length-6 palindromic spectrum set. Here we note that this tree does not conform well to the validation tree in Figure 14. Rat and 
mouse are shown to be closely related but inaccurately, the tree shows that the bacterium and the worm are also closely related. 



Table 4 Ranking of sets 



Ranking 



Set Seed 



Figure 



CCGGAT 
AAAm 
AATOG 
Palindromes, Length-6 
CCCGGG 



15 
16 
17 
19 
18 



Ranking of Spectrum Sets over Chromosomal Data: We note the best to worse resemblance of spectrum set trees to actual taxonomy data. For this data set, the 
CCGGA 7"-spectrum set created a tree which most closely resembled the one based on the classification in NCBI taxonomy database in Figure 14. 



sequence recognition softwares such as BLAST [31] or 
BLAT [32]. While these methods provide powerful 
sequence analysis, they require expansive hardware 
requirements for use (i.e., memory, storage and fast 
computational power). Our method is a statistical 
approach, programmed in Python to run on basic hard- 
ware and does not require a database for operation. 

Our goals for the future are to test this base composition 
framework using synthetic and biological data to further 
analyze its performance and levels of sensitivity. This work 
will be conducted using MetaSim, to generate contigs of a 
10 X coverage for two or more genomes which we shall 
apply to our binning method. This study will help to give 
us a more realistic interpretation of its power for discrimi- 
nating contigs and how best to use it as a pre-processing 
step to sequence assembly. 
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